Identifying the protein folding nucleus using molecular dynamics 
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Molecular dynamics simulations of folding in an 
off-lattice protein model reveal a nucleation sce- 
nario, in which a few well-defined contacts are 
formed with high probability in the transition 
state ensemble of conformations. Their appear- 
ance determines folding cooperativity and drives 
the model protein into its folded conformation. 

Thermodynamically, the folding transition in small 
proteins is analogous to a first-order transition whereby 
two thermodynamic states pj (folded and unfolded) are 
free energy minima while intermediate states are unsta- 
ble. The kinetic mechanism of transitions from the un- 
folded state to the folded state is nucleation Fold- 
ing nuclei can be defined as the minimal stable element of 
structure whose existence results in subsequent rapid as- 
sembly of the native state. This definition corresponds to 
a "post-critical nucleus" related to the first stable struc- 
tures that appear immediately after the transition state 
is overcome 0. The thermal probability of a transition 
state conformation is low compared to the folded and 
unfolded states, which are both accessible at the folding 
transition temperature TV (see Fig. |l|). 

Kinetic analyses [ll| for a number of lattice model 
chains of different lengths and degrees of sequence de- 
sign (optimization) point to a specific protein folding nu- 
cleus scenario. Passing through the transition state with 
subsequent rapid assembly of the native conformation re- 
quires the formation of some (small) number of specific 
obligatory contacts (protein folding nucleus). This result 
has been verified ^ for sequences designed in the lattice 
model using different sets of potentials, where it is shown 
that nucleus location was identical for two different se- 
quences designed with different potentials to fold into 
the same structure of a lattice 48-mer. This finding and 
related results ||[l2| suggest that the folding nucleus lo- 
cation depends more on the topology of the native struc- 
ture than on a particular sequence that folds into that 
structure. 

The dominance of geometrical/topological factors in 
the determination of the folding nucleus is a remark- 
able property that has evolutionary implications (see be- 
low). It is important to understand the physical origin 
of this property of folding proteins and assess its gen- 
erality. To this end, it is important to study other than 
lattice models and other than Monte-Carlo dynamic algo- 
rithms. Here we employ the discrete molecular dynamics 
(MD) simulation technique (the Go model Jl^-p^] with 
the square- well potential of the inter-residue interaction) 
to search for the nucleus in a continuous off-lattice model 

MM. 



The transition region 

Our proposed method to search for a folding nucleus is 
based on the observation H that equilibrium fluctuations 
around the native conformation can be separated into 
"local" unfolding (followed by immediate refolding) and 
"global" unfolding that leads to a transition into an un- 
folded state and requires longer time to refold. Local 
unfolding fluctuations are the ones that do not reach the 
top of the free energy barrier and, hence, are committed 
to moving quickly back to the native state. In contrast, 
global unfolding fluctuations are the ones that overcome 
the barrier and are committed to descend further to the 
unfolded state. Similarly, the fluctuations from the un- 
folded state can be separated into those that descend 
back to the unfolded state and those that result in pro- 
ductive folding. The difference between the two modes of 
fluctuation is whether or not the major free energy bar- 
rier is overcome. This means that the nucleation contacts 
(i. e. the ones that are formed on the "top" of the free 
energy barrier as the chain passes it upon folding) should 
be identified as contacts that are present in the "maxi- 
mally locally unfolded" conformations but are lost in the 
globally unfolded conformations of comparable energy. 

Thus, in order to identify the folding nucleus, we study 
the conformations of the 46-mer that appear in various 
kinds of folding ^ unfolding fluctuations. First, con- 
sider the time behavior of the potential energy at Tf (see 
Fig. |^a). The transition state conformations belong to 
the transition region TR from the folded state to the un- 
folded state that lies in the energy range { — 110 < E < 
—90} (see Fig. [[]). Region TR corresponds to the mini- 
mum of the histogram of the energy distribution. If we 
know the past and the future of a certain conformation 
that belongs to the TR, we can distinguish four types 
of such conformations (see Figs. || and ||a): (i) UU con- 
formations that originate in and return to the unfolded 
region without ascending to the folded region; (ii) FF 
conformations that originate in and return to the folded 
region without descending to the unfolded region; (in) 
UF conformations that originate in the unfolded region 
and descend to the folded region; and (iv) FU conforma- 
tions that originate in the folded region and descend to 
the unfolded region. 

If the nucleus exists, then the UF, FU, FF, and UU con- 
formations must have different properties depending on 
their history. For example, the difference between UF, 
FU, FF, and UU conformations is pronounced for the rms 
displacements from the native state of the residues in 
the vicinity of the residues 10 and 40 and is illustrated 
in Fig. ||b. One differ ence between the FF conformations 
and U U conformations is that the protein folding nucleus 
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is more likely to be retained in the FF conformations than 
in the UU conformations. The contacts belonging to the 
critical nucleus ("nucleation contacts") start appearing 
in the UF conformations, and start disappearing in the 
FU conformations, so that the frequencies of nucleation 
contacts in UF and FU conformations should be in be- 
tween FF and UU. 

Our goal is to select the contacts that are crucial for 
the folding unfolding transition. To this end we se- 
lect the contacts that appear much more often in the FF 
conformations than in the UU conformations. We dis- 
cover that if we set the threshold for the difference in 
contact frequencies between FF and UU conformations 
to be 0.2, then there are only five contacts that per- 
sist: (residue 11, residue 39), (10,40), (11,40), (10,41), 
and (11,41) (see Fig. ||3,d). These contacts can serve as 
evidence for the protein folding nucleus in the folding ^ 
unfolding transition in our model. 

Next, we demonstrate that these five selected contacts 
belong to the protein folding nucleus. Suppose we fix 
just one of them, e. g. (10,40), i. e. we impose a cova- 
lent ("permanent") link between residue 10 and residue 
40. If this contact belongs to the protein folding nucleus, 
its fixation by a covalent bond would eliminate the bar- 
rier between the folded and unfolded states, i. e. only 
the native basin of attraction will remain. Hence, we 
hypothesize that the cooperative transition between the 
unfolded and folded state will be eliminated and the en- 
ergy histogram (Fig. |l|) should change qualitatively from 
bimodal to unimodal. Our MD simulations support this 
hypothesis (Fig. ^): fixation of only one nucleation con- 
tact, (10,40), gives rise to a qualitative change in the 
energy distribution from bimodal to unimodal. Indeed, 
the probability to find an unfolded state with a fixed link, 
(10,40), which belongs to the protein folding nucleus, is 
drastically reduced compared to the probability of the 
unfolded state of the original 46-mer, indicating the im- 
portance of the selected contact (10,40). 

To provide a "control" that a specific contact plays 
such a dramatic role in changing the character of the 
energy landscape, we fix a randomly-chosen contact, 
(19, 37), which is not predicted by our analysis, to belong 
to the critical nucleus. Our hypothesis predicts no qual- 
itative change in the energy distribution histogram since 
the barrier, determined primarily by nucleation contacts, 
should not change dramatically for this control. Fig. |] 
shows that this is indeed the case. (The stability of the 
folded state is somewhat increased for the control because 
any preformed native contacts decrease the entropy of the 
unfolded state — i. e. they stabilize the folded state). 

We also find that for the UF conformation that the 
rms displacements of the residues from their native po- 
sitions are smaller than those for the FU conformations 
(Fig. |^b). This observation is consistent with the fact 
that the nucleation contacts are formed first upon enter- 
ing into productive folding and are destroyed last upon 



unfolding. 

Discussion 

Our main conclusion is that the existence of a few (ss 5) 
specific contacts is signature of the transition state con- 
formations. Those contacts can be defined as the pro- 
tein folding nucleus. Other contacts may also be present 
in transition state conformations. However, they are op- 
tional and vary from conformation to conformation, while 
nucleation contacts are present in transition state confor- 
mations with high probability. Formation of nucleation 
contacts can be considered as an obligatory step in the 
folding process: after they are formed the major barrier is 
overcome and subsequent folding proceeds "downhill" in 
the free energy landscape without encountering any fur- 
ther major free energy barriers. This is illustrated by our 
results that show that even one nucleation contact elim- 
inates the free energy barrier and, hence, leads to fast 
"downhill" motion to the folded state. As a control our 
results show that fixation of an arbitrary non-nucleation 
contact does not result in a similar effect. 

The protein folding nucleus scenario of the transition 
state was initially derived from Monte-Carlo studies of 
lattice models Hf^D and was consistent with protein en- 
gineering experiments with several small proteins Jl9| , p0| . 
Here, for the first time, we confirm this scenario in the off- 
lattice MD simulations. The consistency between conclu- 
sions made in different simulations (6|j^,^| and in experi- 
ments jl9],^0) is remarkable, and supports the possibility 
that the protein folding nucleus formation is a generic 
scenario to describe the protein folding transition state. 

Our present study buttresses the point that the loca- 
tion of a protein folding nucleus is determined by the 
geometry of the native state rather than the energet- 
ics of interactions in the native state (the two factors 
are not entirely independent, since native contacts must 
be generally more stable to provide stability to the na- 
tive conformation). In the present study, we used the 
Go model (where all native contacts have the same en- 
ergy). Nonetheless, it turns out that some contacts (nu- 
cleation) are "more equal than others" in terms of their 
role in shaping the free energy landscape of the chain and 
determining folding kinetics. This fact has implications 
for protein evolution, raising the possibility that proteins 
that have similar structures but different sequences may 
have similarly located protein folding nuclei. This pre- 
diction was verified for SH3 domains (2(]j2l|] and for cold- 
shock proteins [^2| . In terms of the evolutionary selection 
of protein sequences, the robustness of the folding nu- 
cleus suggests that any additional evolutionary pressure 
that controls the folding rate may have been applied se- 
lectively to nucleus residues, so that nucleation positions 
may have been under double (stability + kinetics) pres- 
sure in all proteins that fold into a given structure. Such 
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additional evolutionary pressure has indeed been found 
in the analysis of several protein superfamilies [Tc| ] . 



Methods 

We study a "beads on a string" model of a protein. We 
model the residues as hard spheres of unit mass. The 
potential of interaction between residues is "square- well" . 
We follow the Go model [p^|-^5[, where the attractive 
potential between residues is assigned to the pairs that 
are in contact in the native state and repulsive potential 
is assigned to the pairs that are not in contact in the 
native state. Thus, the potential energy is given by 
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where i and j denote residues i and j. Uij is the matrix 
of pair interactions 



~oo, \n —rj\ < a 
Uij = ( -Aije, a < \n -Tj\ < ax 



(2) 
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Here, ao/2 is the radius of the hard sphere, and a\/2 is 
the radius of the attractive sphere and e sets the energy 
scale. 1 1 All is a matrix of contacts with elements 



a "heat bath" . Thus, by changing the kinetic energy of 
those heat bath particles we are able to control the tem- 
perature of the environment. The heat bath particles 
are hard spheres of the same radii as the chain residues 
and have unit mass. Temperature is measured in units of 
e/kB- The variable time step is defined by the shortest 
time between two consecutive collisions. 
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protein is in the native conformation. Note that we pe- 
nalize the non- native contacts by imposing Ay < 0. The 
parameters are chosen as follows: e = 1, ao = 9.8 and 
ax — 19.5. The covalent bonds are also modeled by a 
square-well potential: 
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The values of bo = 9.9 and bx — 10.1 are chosen so that 
average covalent bond length is equal to 10. The original 
configuration of the protein (N = 46 residues) was de- 
signed by collapse of a homopolymer at low temperature 
p3| p5| . It contains n* = 212 native contacts, so the na- 
tive state energy Ens = —212. The radius of gyration of 
the globule in the native state is Rg ~ 20. The folding 
transition temperature Tf = 1.44 is determined by the 
location of the peak in the heat capacity dependence on 
temperature. 

Our simulations employ the discrete MD algorithm 
p6| p"8[ . To control the temperature of the protein we in- 
troduce 954 particles that interact with the protein and 
with each other via hard-core collisions and so serve as 
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FIG. 2. The schematic definition of the four types of con- 
formations: FF, UU, UF, and FU. 
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FIG. 1. The probability distribution of the energy states 
E of the 46-mer maintained at the folding transition tem- 
perature Tf = 1.44. The bimodal distribution indicates the 
presence of two dominant states: the folded (region F) and 
the unfolded (region U) states. The transition state ensemble 
belongs to region TR of the histogram { — 110 < E < —90}. 
The insets show typical conformations in the folded and un- 
folded regions. 
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FIG. 3. (a) The time evolution of the energy E of the 
46-mer maintained at the folding temperature Tf — 1.44. 
We focus on region TR: {-110 < E < -90}, which is de- 
noted by two red dash-dotted lines. The triangles with arrows 
pointing up, down, left and right denote the conformations 
FF, UU, UF and FU. The upper figure magnifies the region 
2-10 5 < t < 3-10 5 . (b) The average rms displacement of each 
residue from its native state for each of the four types of TR 
conformations: FF, UU, UF and FU. Note that there is a pro- 
nounced difference in rms displacement of the residues from 
their native state in the vicinity of the residues 10 and 40. 
The red dash-dotted line indicates the breaking point of the 
native contacts, i. e. when the rms displacement a is approxi- 
mately the size of the average relative distance between pairs 
of residues, i. e. a — (ao + «i)/2 ~ 15 (see "Methods" section 
for the definition of ao and ai). (c) The contact map of the 
model protein. The darker the shade of grey, the larger is the 
frequency of a contact. Above the diagonal of the square ma- 
trix shows the native contacts (see "Methods" section) of the 
FF conformations (if the native contact frequency is larger 
than 0.2). Below the diagonal of the square matrix shows 
the difference between the frequencies of the native contacts 
in FF and UU conformations (if this difference is larger than 
0.2). Five contacts that persist in the FF conformations — 
(11, 39), (10, 40), (11, 40), (10, 41), and (11, 41) — are marked 
by crosses. The figure shows that identification of the protein 
folding nucleus is facilitated by the method used to construct 
the region of the matrix below the diagonal, (d) The contact 
frequency difference of the FF conformations versus UU con- 
formations plotted versus the contact index k — — l)/2+j, 
where < j < i = 1,2,... 45. The dashed-dotted line indi- 
cates the threshold value for the frequency differences. The 
circles indicate the contacts whose frequency differences are 
larger than the threshold. (Due to the scale used, only four 
contacts can be seen in the figure.) 
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FIG. 4. (a) The probability distribution of the energy 
states E of the: (i) original 46-mer (at T/ = 1.44), (ii) 46-mcr 
(at T — 1.46) with a fixed contact belonging to the protein 
folding nucleus, (10,40), and (Hi) 46-mer (at T = 1.46) with 
fixed randomly-chosen control contact (19,37), which does 
not belong to the protein folding nucleus. Note that the prob- 
ability of the unfolded state of the 46-mer with a fixed contact 
belonging to the protein folding nucleus, is suppressed com- 
pared to that of the original 46-mer. (b) The time evolution 
of the energy E of (i) original (left) and (ii) fixed (10, 40) con- 
tact (right). Case (Hi) fixed (19,37) contact is similar to (i), 
so we do not show it. For case (i), the fluctuations are mostly 
between two extreme values of energy, corresponding to the 
folded and unfolded states. In contrast, for case (ii), the fluc- 
tuations are mostly around one energy value, corresponding 
to the folded state. 
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